BMC Medical Research Methodology — Latest Matching Preprints

1

Use of machine learning for comparing disease risk scores and propensity scores under complex confounding and large sample size scenarios: a simulation study

GUO, Y.; STRAUSS, V. Y.; PRIETO-ALHAMBRA, D.; Khalid, S.

2022-02-04 epidemiology 10.1101/2022.02.03.22270151 medRxiv

Top 0.1%

59.3%

Show abstract

BackgroundThe surge of treatments for COVID-19 in the ongoing pandemic presents an exemplar scenario with low prevalence of a given treatment and high outcome risk. Motivated by that, we conducted a simulation study for treatment effect estimation in such scenarios. We compared the performance of two methods for addressing confounding during the process of estimating treatment effects, namely disease risk scores (DRS) and propensity scores (PS) using different machine learning algorithms. MethodsMonte Carlo simulated data with 25 different scenarios of treatment prevalence, outcome risk, data complexity, and sample size were created. PS and DRS matching with 1: 1 ratio were applied with logistic regression with least absolute shrinkage and selection operator (LASSO) regularization, multilayer perceptron (MLP), and eXtreme Gradient Boosting (XgBoost). Estimation performance was evaluated using relative bias and corresponding confidence intervals. ResultsBias in treatment effect estimation increased with decreasing treatment prevalence regardless of matching method. DRS resulted in lower bias compared to PS when treatment prevalence was less than 10%, under strong confounding and nonlinear nonadditive data setting. However, DRS did not outperform PS under linear data setting and small sample size, even when the treatment prevalence was less than 10%. PS had a comparable or lower bias to DRS when treatment prevalence was common or high (10% - 50%). All three machine learning methods had similar performance, with LASSO and XgBoost yielding the lowest bias in some scenarios. Decreasing sample size or adding nonlinearity and non-additivity in data improved the performance of both PS and DRS. ConclusionsUnder strong confounding with large sample size DRS reduced bias compared to PS in scenarios with low treatment prevalence (less than 10%), whilst PS was preferable for the study of treatments with prevalence greater than 10%, regardless of the outcome prevalence. Key MessagesO_LIWhen handling nonlinear nonadditive data with strong confounding, DRS estimated by machine learning methods outperforms PS in scenarios with low treatment prevalence (less than 10%). C_LIO_LIHowever, if having linear data and small sample size data with strong confounding, we did not observe DRS outperformed PS even when treatment prevalence was less than 10%. C_LIO_LIOur results suggested that using PS performed better compared to DRS in tackling strong confounding problems with treatment prevalence greater than 10%. C_LIO_LISmall sample size increased bias for both DRS and PS methods, and it affected DRS more than PS. C_LI

2

The competing risk between in-hospital mortality and recovery: A pitfall in COVID-19 survival analysis research

Oulhaj, A.; Ahmed, L. A.; Prattes, J.; Suliman, A.; Al Suwaidi, A.; Al-Rifai, R. H.; Sourij, H.; Van Keilegom, I.

2020-07-14 infectious diseases 10.1101/2020.07.11.20151472 medRxiv

Top 0.1%

40.6%

Show abstract

BackgroundA plethora of studies on COVID-19 investigating mortality and recovery have used the Cox Proportional Hazards (Cox PH) model without taking into account the presence of competing risks. We investigate, through extensive simulations, the bias in estimating the hazard ratio (HR) and the absolute risk reduction (ARR) of death when competing risks are ignored, and suggest an alternative method. MethodsWe simulated a fictive clinical trial on COVID-19 mimicking studies investigating interventions such as Hydroxychloroquine, Remdesivir, or convalescent plasma. The outcome is time from randomization until death. Six scenarios for the effect of treatment on death and recovery were considered. The HR and the 28-day ARR of death were estimated using the Cox PH and the Fine and Gray (FG) models. Estimates were then compared with the true values, and the magnitude of misestimation was quantified. ResultsThe Cox PH model misestimated the true HR and the 28-day ARR of death in the majority of scenarios. The magnitude of misestimation increased when recovery was faster and/or chance of recovery was higher. In some scenarios, this model has shown harmful treatment effect when it was beneficial. Estimates obtained from FG model were all consistent and showed no misestimation or changes in direction. ConclusionThere is a substantial risk of misleading results in COVID-19 research if recovery and death due to COVID-19 are not considered as competing risk events. We strongly recommend the use of a competing risk approach to re-analyze relevant published data that have used the Cox PH model.

3

Variable selection for competing risk regression models: recommendations for analyzing data from epidemiological studies

Mullaert, J.; Schmeller, S.; Austin, P. C.; Latouche, A.

2024-11-26 epidemiology 10.1101/2024.11.25.24317882 medRxiv

Top 0.1%

40.1%

Show abstract

When fitting competing risks regression models, a variety of variable selection methods exist, including backward selection on the subdistribution hazard, on the cause-specific hazards, and penalized methods. However, a benchmark study comparing these different procedures is lacking. We conducted an extensive simulation study to compare three variable selection procedures in terms of both model selection ability and predictive accuracy. 5120 datasets were simulated in various conditions aiming at being representative of real applications in clinical epidemiology. Results show that the backward selection procedure can lead to high false discovery rate (FDR) because of implementation choices. Even for scenarios with a high numbers of events per variable (EPV), the true model is rarely identified by any of the tested procedures. Survival predictions were assessed with time-dependent AUC and show similar performances for all methods. We also provide an application on real data from stem cell transplanted patients in hematology. We conclude that the identification of the true model in competing risk regression is a very difficult task, and suggest some recommendations to analysts: (1) to report event per variable for the event type of interest and (2) to use multiple methods to deal with model uncertainty and avoid implementation pitfalls.

4

A comparison of different methods for handling measurements affected by medication use

Choi, J.; Dekkers, O. M.; le Cessie, S.

2022-04-27 epidemiology 10.1101/2022.04.23.22273899 medRxiv

Top 0.1%

39.5%

Show abstract

In epidemiological research it is common to encounter measurements affected by medication use, such as blood pressure lowered by antihypertensive drugs. When one is interested in the relation between the variables not affected by medication, ignoring medication use can cause bias. Several methods have been proposed, but the problem is often ignored or handled with generic methods, such as excluding individuals on medication or adjusting for medication use in the analysis. This study aimed to investigate methods for handling measurements affected by medication use when one is interested in the relation between the unaffected variables and to provide guidance for how to optimally handle the problem. We focused on linear regression and distinguish between the situation where the affected measurement is an exposure, confounder or outcome. In the Netherlands Epidemiology of Obesity study and in several simulated settings, we compared generic and more advanced methods, such as substituting or adding a fixed value to the treated values, regression calibration, censored normal regression, Heckmans treatment model and multiple imputation methods. We found that often-used methods such as adjusting for medication use could result in substantial bias and that methods for handling medication use should be chosen cautiously.

5

Double Machine Learning for Causal Inference in High-Dimensional Electronic Health Records

Du, M.; Guo, Y.; Li, X.; Catala, M.; Prieto-Alhambra, D.

2025-07-22 epidemiology 10.1101/2025.07.21.25331944 medRxiv

Top 0.1%

39.5%

Show abstract

BackgroundEstimating causal effects in observational health data is challenging due to confounding by indication. Traditional approaches such as inverse probability of treatment weighting (IPTW) rely on correct model specification, which is difficult in high-dimensional settings. We implemented an offset-based double machine learning (Offset-DML) practical framework for estimating binary treatment effects on the log-odds scale using logistic regression. MethodsWe have conducted a plasmode simulation study based on real-world clinical data, varying sample sizes (5,000, 10,000, 20,000) and outcome prevalence (5%, 10%, 20%) with 200 repetitions. We compared the performance of IPTW, stabilised IPTW, offset-DML (with and without cross-fitting), and high-dimensional DML (HD-DML). We measured and compared the performance of the different models with the following metrics: absolute bias, empirical standard error, and root mean square error relative to the true average causal effect. ResultsAcross most scenarios, DML-based approaches outperformed IPTW methods in terms of bias and empirical standard error, particularly in larger sample sizes. Offset-DML showed comparable performance to HD-DML while avoiding convergence issues observed with HD-DML in sparse data settings. All DML methods had overlapping confidence intervals in most scenarios. ConclusionOffset-DML is a practical and robust alternative for causal inference in high-dimensional health data. Future work should investigate extensions to other outcomes and diagnostics to assess confounding control. Key messagesO_LIDouble machine learning based methods consistently outperform IPTW regarding bias and empirical standard error, particularly in large sample sizes and sparse-data scenarios. C_LIO_LIOffset Double machine learning is a practical and robust binary causal effect estimation method in high-dimensional settings. C_LIO_LIUnlike high-dimensional Double machine learning, the offset-based Double machine learning approach demonstrated consistent convergence across all scenarios, including those with low outcome prevalence and small sample sizes. C_LI

6

Reassessing Fragility: A Comparative Analysis of the Fragility Index With the Relative Risk Index

Heston, T. F.

2023-10-04 epidemiology 10.1101/2023.10.04.23296567 medRxiv

Top 0.1%

39.4%

Show abstract

BackgroundIn biostatistics, assessing the fragility of research findings is crucial for understanding their clinical significance. This study focuses on the fragility index, unit fragility index, and relative risk index as measures to evaluate statistical fragility. The relative risk index quantifies the deviation of observed findings from therapeutic equivalence. In contrast, the fragility indices assess the susceptibility of p-values to change significance with minor alterations in outcomes within a 2x2 contingency table. While the fragility indices have intuitive appeal and have been widely applied, their behavior across a wide range of contingency tables has not been rigorously evaluated. MethodsUsing a Python software program, a simulation approach was employed to generate random 2x2 contingency tables. All tables under consideration exhibited p-values < 0.05 according to Fishers exact test. Subsequently, the fragility indices and the relative risk index were calculated. To account for sample size variations, fragility, and risk quotients were also calculated. A correlation matrix assessed the collinearity between each metric and the p-value. ResultsThe analysis included 2,000 contingency tables with cell counts ranging from 20 to 480. Notably, the formulas for calculating the fragility indices encountered limitations when cell counts approached zero or duplicate cell counts hindered standardized application. The correlation coefficients with p-values were as follows: unit fragility index (-0.806), fragility index (-0.802), fragility quotient (-0.715), unit fragility quotient (-0.695), relative risk index (-0.403), and relative risk quotient (-0.261). ConclusionCompared with the relative risk index and quotient, in the context of p-values < 0.05, the fragility indices and their quotients exhibited stronger correlations. This implies that the fragility indices offer limited additional information beyond the p-value alone. In contrast, the relative risk index displays relative independence, suggesting that it provides meaningful insights into statistical fragility by assessing how far observed findings deviate from therapeutic equivalence, regardless of the p-value.

7

Dirichlet process mixture models to estimate outcomes for individuals with missing predictor data: application to predict optimal type 2 diabetes therapy in electronic health record data

Cardoso, P.; Dennis, J. M.; Bowden, J.; Shields, B.; McKinley, T.; MASTERMIND Consortium,

2022-07-29 epidemiology 10.1101/2022.07.26.22278066 medRxiv

Top 0.1%

33.7%

Show abstract

BackgroundMissing data is a common problem in regression modelling. Much of the literature focuses on handling missing outcome variables, but there are also challenges when dealing with missing predictor information, particularly when trying to build prediction models for use in practice. MethodsWe develop a flexible Bayesian approach for handling missing predictor information in regression models. For prediction this provides practitioners with full posterior predictive distributions for both the missing predictor information and the outcome variable, conditional on the observed predictors. We apply our approach to a previously proposed treatment selection model for type 2 diabetes second-line therapies. Our approach combines a regression model and a Dirichlet process mixture model (DPMM), where the former defines the treatment selection model and the latter provides a flexible way to model the joint distribution of the predictors. ResultsWe show that under missing-completely-at-random (MCAR) and missing-at-random (MAR) assumptions (with respect to the missing predictors), the DPMM can model complex relationships between predictor variables, and predict missing values conditionally on existing information. We also demonstrate that in the presence of multiple missing predictors, the DPMM model can be used to explore which variable(s), if collected, could provide the most additional information about the likely outcome. ConclusionsOur approach can provide practitioners with supplementary information to aid treatment selection decisions in the presence of missing data, and can be readily extended to other types of response model. Key MessagesO_LIMissing predictor variables present a significant challenge when building and implementing prediction models in clinical practice. C_LIO_LIRemoving individuals with missing information and performing a complete case analysis can lead to imprecision and bias. Multiple imputation approaches typically translate uncertainty through prediction model parameter standard errors, as opposed to a consistent joint probability model. C_LIO_LIAlternatively, a Bayesian approach using Dirichlet process mixture models (DPMMs) offers a flexible way to model complex joint distributions of predictor variables, which can be used to estimate posterior (predictive) distributions for the missing predictors, conditional on the observed predictors. C_LIO_LIUsing a DPMM, in this way allows uncertainties around missing predictor data to be propagated through to a prediction model of interest using a Bayesian hierarchical framework. This allows prediction models to be developed using datasets with incomplete predictor information (assuming missing-completely-at-random/missing-at-random). Furthermore, predictions can be made on new individuals even if they have incomplete predictor information (under the same assumptions). C_LIO_LIThis approach provides full posterior predictive probability distributions for both missing predictor variables and the outcome variable, allowing a wide range of probabilistic models outputs to be derived to support clinical decision making. C_LI

8

PyTMLE: A Flexible Python Library for Targeted Estimation of Survival and Competing Risks using Causal Machine Learning

Guski, J.; Aborageh, M.; Fröhlich, H.

2025-07-03 health informatics 10.1101/2025.07.02.25330730 medRxiv

Top 0.1%

33.1%

Show abstract

BackgroundTargeted estimation offers a robust and unbiased approach for causal inference of the average treatment effect (ATE) from observational data, even with confounding, dependent censoring, and competing risks. Its advantages include double robustness, statistical rigor, and flexible data-adaptive modeling, potentially leveraging machine/deep learning. However, existing implementations lack model selection flexibility and are R-based, hindering adoption by the Python-focused machine learning community. ResultsWe propose PyTMLE, a flexible Python package for causal machine learning-based targeted estimation with survival outcomes and competing risks. PyTMLE supports scikit-survival and pycox, and inbuilt robustness checks based on E-values. PyTMLE is easy to use with initial estimation of nuisance parameters that are obtained via super learning by default. We showcase its basic usage on the established Hodgkins disease dataset, where our package reveals the protective effect of chemotherapy on relapse risk. ConclusionsThis package promotes targeted estimation in time-to-event analysis for applied machine learning, enabling fully data-adaptive nuisance parameter estimation, potentially with deep learning. Future enhancements may include time-dependent confounders and dynamic treatment regimes.

9

Silence is golden, by my measures still see: why cheap-but-noisy outcome measures can be more cost effective than gold standards.

Woolf, B.; Pedder, H.; Rodriguez-Broadbent, H.; Edwards, P.

2022-05-19 epidemiology 10.1101/2022.05.17.22274839 medRxiv

Top 0.1%

33.1%

Show abstract

ObjectiveTo assess the cost-effectiveness of using cheap-but-noisy outcome measures, such as a short and simple questionnaire. BackgroundTo detect associations reliably, studies must avoid bias and random error. To reduce random error, we can increase the size of the study and increase the accuracy of the outcome measurement process. However, with fixed resources there is a trade-off between the number of participants a study can enrol and the amount of information that can be collected on each participant during data collection. MethodTo consider the effect on measurement error of using outcome scales with varying numbers of categories we define and calculate the Variance from Categorisation that would be expected from using a category midpoint; define the analytic conditions under-which such a measure is cost-effective; use meta-regression to estimate the impact of participant burden, defined as questionnaire length, on response rates; and develop an interactive web-app to allow researchers to explore the cost-effectiveness of using such a measure under plausible assumptions. ResultsCompared with no measurement, only having a few categories greatly reduced the Variance from Categorization. For example, scales with five categories reduce the variance by 96% for a uniform distribution. We additionally show that a simple measure will be more cost effective than a gold-standard measure if the relative increase in variance due to using it is less than the relative increase in cost from the gold standard, assuming it does not introduce bias in the measurement. We found an inverse power law relationship between participant burden and response rates such that a doubling the burden on participants reduces the response rate by around one third. Finally, we created an interactive web-app (https://benjiwoolf.shinyapps.io/cheapbutnoisymeasures/) to allow exploration of when using a cheap-but-noisy measure will be more cost-effective using realistic parameter. ConclusionCheap-but-noisy questionnaires containing just a few questions can be a cost effect way of maximising power. However, their use requires a judgment on the trade-off between the potential increase in risk information bias and the reduction in the potential of selection bias due to the expected higher response rates. Key Messages- A cheap-but-noisy outcome measure, like a short form questionnaire, is a more cost-effective method of maximising power than an error free gold standard when the percentage increase in noise from using the cheap-but-noisy measure is less than the relative difference in the cost of administering the two alternatives. - We have created an R-shiny app to facilitate the exploration of when this condition is met at https://benjiwoolf.shinyapps.io/cheapbutnoisymeasures/ - Cheap-but-noisy outcome measures are more likely to introduce information bias than a gold standard, but may reduce selection bias because they reduce loss-to-follow-up. Researchers therefore need to form a judgement about the relative increase or decrease in bias before using a cheap-but-noisy measure. - We would encourage the development and validation of short form questionnaires to enable the use of high quality cheap-but-noisy outcome measures in randomised controlled trials.

10

Propensity-score matching with GAN-generated observations from electronic health records: simulation study and application to the evaluation of prone positioning in COVID-19 patients under mechanical ventilation

Bouvarel, B.; Glemain, B.; Carrat, F.; Lapidus, N.

2025-08-01 health informatics 10.1101/2025.07.31.25332504 medRxiv

Top 0.1%

33.1%

Show abstract

BackgroundPropensity score (PS) methods are widely used in observational studies to estimate causal effects, but they often exclude patients due to a lack of comparable counterparts, leading to reduced power and potential bias. Generative adversarial networks (GANs) have shown promise in creating synthetic data, but their application to causal inference remains underexplored. Synthetic data could be used as plausible counterfactuals, potentially mitigating the issues of the PS methods. This study evaluates the integration of GAN-generated synthetic observations into propensity score matching (PSM) to improve the emulation of RCTs, using both simulated and real-world electronic health record (EHR) data. MethodsA simulation study was conducted using with predefined confounding structures to compare traditional PSM against two hybrid approaches incorporating GAN-generated synthetic patients to partially or fully match the original sample of patients. Treatment effects were estimated via logistic regression, and performance was assessed by bias, standard error, alpha risk, power, and confidence interval coverage. The methods were applied to a real-world dataset of mechanically ventilated COVID-19 patients to evaluate the impact of early prone positioning on 28-day mortality. ResultsIn simulations, GAN-generated patients permitted to match all patients in the original sample, whereas PSM dropped up to 60% of them. While synthetic augmentation improved sample size, unadjusted use of synthetic matches led to underestimated standard errors and inflated type I error. Down-weighting matched synthetic data improved error control but did not consistently outperform PSM in bias or power. In the real-world application (n=1399), treatment effect estimates for prone positioning were similar across all methods and did not reach statistical significance. ConclusionGAN-augmented propensity score matching can reduce sample loss. However, its current application in causal inference through PS matching remains limited. Synthetic data do not contribute independent information and must be cautiously integrated to avoid misleading precision. While promising, current GAN implementations require methodological refinements before routine use in causal inference.

11

A Multi-State Markov Model for the Longitudinal Analysis of Clinical Composite Outcomes in Heart Failure

Lora, D.; Leiva-Garcia, A.; Bernal, J. L.; Velez, J.; Palacios, B.; Villareal, M.; Capel, M.; Rosillo, N.; Hernandez, M.; Bueno, H.

2023-12-07 epidemiology 10.1101/2023.12.05.23299570 medRxiv

Top 0.1%

29.3%

Show abstract

BackgroundThe statistical analysis of composite outcomes is challenging. The Clinical Outcomes, HEalthcare REsource utilizatioN, and relaTed costs (COHERENT) model was developed to describe and compare all components (incidence, timing and duration) of composite outcomes, but its statistical analysis remained unsolved. The aim of the study is to assess a multi-State Markov model as one statistical solution for the COHERENT model. MethodsA cohort of 3280 patients admitted to the emergency department or hospital for heart failure during year 2018 were followed during one year. The state of the patient was registered at the end of each day during 365 days as: home, emergency department (ED), hospital, re-hospital, re-ED, and death. Outcomes of patients with or without severe renal disease (sRD) were compared as an example. A Multi-State Markov model was developed to explain transitions to and from these states during follow-up. ResultsA Multi-State Markov model showed, adjusted for age and sex, a significantly lower likelihood of patients with sRD to return home regardless of the state in which they were (ED [->] HOME (HR, 0.72; 95%CI, 0.54-0.95), RE-ED [->] HOME (HR, 0.83; 95%CI, 0.75-0.93), HOSPITAL [->] HOME (HR, 0.77; 95%CI, 0.69-0.86), RE-HOSPITAL [->] HOME (HR, 0.82; 95%CI, 0.74-0.92) and a higher mortality risk, in particular at the hospital and at home (HOME [->] Death [HR, 1.54; 95%CI, 1.01-2.37] and HOSPITAL [->] Death [HR, 1.71; 95%CI, 1.30-2.24]. ConclusionMulti-state Markov models offer a statistical solution for the comprehensive analysis of composite outcomes assessed as transitions from different clinical states. Clinical PerspectiveO_LIWhat is new? O_LIAn integrated analysis of all components of composite endpoints including its incidence and duration is possible using the COHERENT model with analysis of transition risks. C_LIO_LIA statistical approach based on Markov chain models is a new potential statistical solution for the multivariate estimation of the risk of transitions in mutually exclusive composite endpoints. C_LI C_LIO_LIWhat are the clinical implications? O_LIThe use of the COHERENT model and Markov models is an opportunity to analyze composite endpoints and understand better the relationships between its components and, potentially, to improve the performance of statistical analysis in randomized controlled trials. C_LIO_LIThe utilization of the COHERENT model and Markov models in randomized controlled trials should be validated in future observational studies and in randomized controlled trials. C_LI C_LI

12

Enhancing data integrity in Electronic Health Records: Review of methods for handling missing data

Vahdati, A.; Cotterill, S.; Marsden, A.; Kontopantelis, E.

2024-05-13 epidemiology 10.1101/2024.05.13.24307268 medRxiv

Top 0.1%

28.6%

Show abstract

IntroductionElectronic Health Records (EHRs) are vital repositories of patient information for medical research, but the prevalence of missing data presents an obstacle to the validity and reliability of research. This study aimed to review and category ise methods for handling missing data in EHRs, to help researchers better understand and address the challenges related to missing data in EHRs. Materials and MethodsThis study employed scoping review methodology. Through systematic searches on EMBASE up to October 2023, including review articles and original studies, relevant literature was identified. After removing duplicates, titles and abstracts were screened against inclusion criteria, followed by full-text assessment. Additional manual searches and reference list screenings were conducted. Data extraction focused on imputation techniques, dataset characteristics, assumptions about missing data, and article types. Additionally, we explored the availability of code within widely used software applications. ResultsWe reviewed 101 articles, with two exclusions as duplicates. Of the 99 remaining documents, 21 underwent full-text screening, with nine deemed eligible for data extraction. These articles introduced 31 imputation approaches classified into ten distinct methods, ranging from simple techniques like Complete Case Analysis to more complex methods like Multiple Imputation, Maximum Likelihood, and Expectation-Maximization algorithm. Additionally, machine learning methods were explored. The different imputation methods, present varying reliability. We identified a total of 32 packages across the four software platforms (R, Python, SAS, and Stata) for imputation methods. However, its significant that machine learning methods for imputation were not found in specific packages for SAS and Stata. Out of the 9 imputation methods we investigated, package implementations were available for 7 methods in all four software platforms. ConclusionsSeveral methods to handle missing data in EHRs are available. These methods range in complexity and make different assumptions about the missing data mechanisms. Knowledge gaps remain, notably in handling non-monotone missing data patterns and implementing imputation methods in real-world healthcare settings under the Missing Not at Random assumption. Future research should prioritize refining and directly comparing existing methods.

13

A plasmode simulation-based bias analysis for residual confounding by unmeasured variables leveraging information-rich subsets

Desai, R. J.; Wang, S.; Pillai, H. S.; Mahesri, M.; Gu, B.; Lii, J.; Dutcher, S. K.; Jones, C.; Shebl, F. M.; Bradley, M. C.; Hua, W.; Lee, H.; Dal Pan, G. J.; Ball, R.; Schneeweiss, S. S.

2025-10-31 epidemiology 10.1101/2025.10.28.25338968 medRxiv

Top 0.1%

28.3%

Show abstract

BackgroundQuantitative bias analyses often rely on unrealistic assumptions and do not fully reflect the complexities of healthcare data. MethodsWe describe a plasmode simulation-based bias analysis for residual confounding from unmeasured variables by leveraging granular information from a subset of cohort members. We generated 500 simulated cohorts based on individual-level claims and linked electronic health record (EHR) data identifying new users of varenicline and bupropion from the Mass General Brigham site of the FDA Sentinel Real World Evidence Data Enterprise. Two adverse outcomes were simulated: 1) neuropsychiatric hospitalizations and 2) major adverse cardiovascular events (MACE), and measured confounding factors, identified from information available in claims including demographics, comorbid conditions, and comedications, were tailored to each outcome. Residual confounding was simulated using potential confounders measured in EHRs but unmeasured in claims including suicidal ideation for the neuropsychiatric outcomes and body mass index (BMI), blood pressure (BP), and smoking pack-years for the MACE outcome. These simulations retained the correlation between claims and EHR-based confounders observed in empirical data for realistic reflection of proxy adjustment of unmeasured confounders. Analyses were conducted in simulated data with and without adjustment for the EHR-based covariates to evaluate the extent of residual confounding in claims-only analyses. ResultsAfter 500 simulations, the median absolute standardized mean difference (ASMD) between treatment groups in the unadjusted sample was 0.16 for suicidal ideation; while <0.1 for BMI, BP, and smoking pack-years. For both outcomes, adjustment using claims-based variables provided relative bias close to 0, leading to the conclusion that EHR-measured confounders that were unmeasured in claims were unlikely to result in strong residual confounding within realistic simulations informed by empirical data. ConclusionThe proposed approach provides a method for quantifying bias in non-randomized studies threatened by unavailability of potentially important confounding variables. Key pointsO_LIResidual confounding by unmeasured factors is a central threat in pharmacoepidemiology that is almost always acknowledged in published studies but seldom quantified. C_LIO_LIWe describe a plasmode-simulation based approach to systematically design quantitative bias analyses that reflect the complexities of routinely collected healthcare data by leveraging detailed electronic health records from a subset. C_LIO_LIWe provide open-source software code to enable other researchers to adopt this method in future studies and improve the reliability of their findings. C_LI Plain language summaryThis study introduces a new way for researchers to better understand and measure bias caused by missing health information in large insurance databases. Using detailed hospital records alongside insurance claims data, we created realistic computer simulations to test how much of the observed risk in safety studies could be explained away by missing important health factors, like depression or smoking habits, that arent always recorded in insurance data. The approach is flexible, uses real patient data, and helps researchers make stronger, more reliable conclusions about risks and benefits of treatments, even when some patient information is not available in all records.

14

Multiple imputation of missing data under missing at random: compatible imputation models are not sufficient to avoid bias

Curnow, E.; Carpenter, J. R.; Heron, J. E.; Cornish, R. P.; Rach, S.; Didelez, V.; Langeheine, M.; Tilling, K.

2022-11-04 epidemiology 10.1101/2022.11.04.22281883 medRxiv

Top 0.1%

27.8%

Show abstract

BackgroundEpidemiological studies often have missing data. Multiple imputation (MI) is a commonly-used strategy for such studies. MI guidelines for structuring the imputation model have focused on compatibility with the analysis model, but not on the need for the (compatible) imputation model(s) to be correctly specified. Standard (default) MI procedures use simple linear functions. We examine the bias this causes and performance of methods to identify problematic imputation models, providing practical guidance for researchers. MethodsBy simulation and real data analysis, we investigated how imputation model mis-specification affected MI performance, comparing results with complete records analysis (CRA). We considered scenarios in which imputation model mis-specification occurred because (i) the analysis model was mis-specified, or (ii) the relationship between exposure and confounder was mis-specified. ResultsMis-specification of the relationship between outcome and exposure, or between exposure and confounder in the imputation model for the exposure, could result in substantial bias in CRA and MI estimates (in addition to any bias in the full-data estimate due to analysis model mis-specification). MI by predictive mean matching could mitigate for model mis-specification. Model mis-specification tests were effective in identifying mis-specified relationships. These could be easily applied in any setting in which CRA was, in principle, valid and data were missing at random (MAR). ConclusionWhen using MI methods that assume data are MAR, compatibility between the analysis and imputation models is necessary, but is not sufficient to avoid bias. We propose an easy-to-follow, step-by-step procedure for identifying and correcting mis-specification of imputation models.

15

qbaconfound: A flexible Monte Carlo probabilistic bias analysis for unmeasured confounding

Kawabata, E.; Shapland, C. Y.; Palmer, T. M.; Carslake, D.; Tilling, K.; Hughes, R.

2025-08-14 epidemiology 10.1101/2025.08.12.25333217 medRxiv

Top 0.1%

27.3%

Show abstract

BackgroundUnmeasured confounding is a persistent concern in observational studies. We can quantitatively assess the impact of unmeasured confounding using a quantitative bias analysis (QBA). A QBA specifies the relationship between the unmeasured confounder(s), U, and study data via its bias parameters. There are two broad classes of QBA methods: deterministic and probabilistic. We focus on a probabilistic QBA which incorporates external information about U via prior distribution(s) placed on these bias parameters and can be implemented as a Bayesian QBA or a Monte Carlo QBA. A Bayesian QBA combines the prior distribution(s) with the datas likelihood function whilst a Monte Carlo QBA samples the bias parameters directly from their prior distributions. Software implementations of probabilistic QBAs to unmeasured confounding are scarce and mainly limited to unadjusted analyses of a binary exposure and outcome. One exception is R package unmconf (Hebdon et al 2024, BMC Med. Res. Methodol., https://doi.org/10.1186/s12874-024-02322-2) which implements a Bayesian QBA, applicable when the analysis is a generalised linear model (GLM). However, for a study with q measured confounders and a single U, unmconf requires information on at least 3+ q bias parameters, which is burdensome when q >1 and validation data are unavailable. AimWe propose a flexible Monte Carlo QBA where the number of bias parameters is independent of the number of measured confounders. It is applicable to a GLM or survival proportional hazards model, with binary, continuous, or categorical exposure and measured confounders, and one or multiple ([≥] 2) binary or continuous unmeasured confounders. MethodsVia simulations, we evaluated our Monte Carlo QBA for different analyses (e.g., varying the regression model, type of variables for the exposure and unmeasured confounder), and different levels of dependency between the measured and unmeasured confounders. Also, using our proposed bias model, we compare a Monte Carlo implementation to a fully Bayesian implementation when the analysis is a linear or logistic regression. We repeat the simulation study for prior distributions with different levels of informativeness. ResultsIgnoring U resulted in substantially biased estimates with substantial confidence interval undercoverage (e.g., 57%). Our Monte Carlo QBA (with informative priors) resulted in unbiased (or minimally biased) point estimates and interval estimates with close to nominal coverage. For binary U, levels of bias were marginally higher when U was strongly correlated with the measured confounders. The performances of the Monte Carlo and Bayesian implementations were comparable. ConclusionWe have proposed a flexible probabilistic QBA for unmeasured confounding which is applicable for a wide range of regression-based analyses. We have minimised the burden placed on the user by limiting the number of bias parameters and avoiding the need for specialist knowledge about Bayesian inference or Bayesian software. Our proposed Monte Carlo QBA will be implemented as Stata command and R package, qbaconfound.

16

Accounting for comorbidity in etiological research

Khachadourian, V.; Janecka, M.

2025-01-21 epidemiology 10.1101/2025.01.19.25320775 medRxiv

Top 0.1%

26.8%

Show abstract

IntroductionDespite the theoretical advancements and recommendations regarding covariate adjustment in causal inference, clinical studies often fail to explicitly state the underlying assumptions related to causal structure among the study variables. Specifically, despite the pervasive nature of comorbidity, explicit causal assumptions about the role of comorbidity in exposure-outcome relationships are often lacking, potentially leading to inappropriate accounting for comorbid conditions and resulting in biased effect estimates. This study aims to explore common causal structures involving comorbidity and provide guidance for handling it in etiologic research. MethodsWe use Directed Acyclic Graphs (DAGs) to depict six causal scenarios involving comorbidity as a confounder, mediator, collider, or consequence of the exposure or outcome. Simulations were conducted across 5,000 iterations for each scenario, assessing the impact of conditioning on comorbidity under three effect measures (mean difference, odds ratio, risk ratio). Bias was evaluated by comparing adjusted and unadjusted effect estimates to the true values. ResultsThe impact of conditioning on comorbidity varied by its causal role. Adjusting for comorbidity mitigated bias when it acted as a confounder, but introduced bias when it was a mediator or collider. In instances where comorbidity was a consequence of either the exposure or outcome, the decision to adjust depended on the research objectives. Nonlinear models revealed differences in marginal and conditional effects due to non-collapsibility. DiscussionExplicit causal assumptions are essential for selecting appropriate analytical strategies in etiologic research. This study provides practical guidance on handling comorbidity-related challenges, highlighting the need for study design and analysis to align with research objectives. Future work should address more complex causal structures and other methodological challenges.

17

Changes in prediction modelling in biomedicine- do systematic reviews indicate whether there is any trend towards larger data sets and machine learning methods?

Lusa, L.; Kappenberg, F.; Collins, G. S.; Schmid, M.; Sauerbrei, W.; Rahnenfuehrer, J.

2024-08-10 health informatics 10.1101/2024.08.09.24311759 medRxiv

Top 0.1%

26.4%

Show abstract

The number of prediction models proposed in the biomedical literature has been growing year on year. In the last few years there has been an increasing attention to the changes occurring in the prediction modeling landscape. It is suggested that machine learning techniques are becoming more popular to develop prediction models to exploit complex data structures, higher-dimensional predictor spaces, very large number of participants, heterogeneous subgroups, with the ability to capture higher-order interactions. We examine these changes in modelling practices by investigating a selection of systematic reviews on prediction models published in the biomedical literature. We selected systematic reviews published since 2020 which included at least 50 prediction models. Information was extracted guided by the CHARMS checklist. Time trends were explored using the models published since 2005. We identified 8 reviews, which included 1448 prediction models published in 887 papers. The average number of study participants and outcome events increased considerably between 2015 and 2019, but remained stable afterwards. The number of candidate and final predictors did not noticeably increase over the study period, with a few recent studies using very large numbers of predictors. Internal validation and reporting of discrimination measures became more common, but assessing calibration and carrying out external validation were less common. Information about missing values was not reported in about half of the papers, however the use of imputation methods increased. There was no sign of an increase in using of machine learning methods. Overall, most of the findings were heterogeneous across reviews. Our findings indicate that changes in the prediction modeling landscape in biomedicine are less dramatic than expected and that poor reporting is still common; adherence to well established best practice recommendations from the traditional biostatistics literature is still needed. For machine learning best practice recommendations are still missing, whereas such recommendations are available in the traditional biostatistics literature, but adherence is still inadequate.

18

Real-World Evidence BRIDGE: a tool to connect protocol with code programming

Cid Royo, A.; Elbers, R.; Weibel, D.; Hoxhaj, V.; Kurkcuoglu, Z.; Sturkenboom, M. C.; Vaz, T. A.; Andaur Navarro, C. L.

2024-05-08 health informatics 10.1101/2024.05.08.24306833 medRxiv

Top 0.1%

26.3%

Show abstract

ObjectiveO_ST_ABSMethodsC_ST_ABSSeveral statistical analysis plans (SAP) from the Vaccine Monitoring Collaboration for Europe (VAC4EU) were analyzed to identify the study design sections and specifications for programming RWE studies based on multi-databases standardized to common data models. We envisioned a metadata schema that transforms the epidemiologists knowledge into a machine-readable format. This machine-readable metadata schema must also contain the different study sections, code lists, and time anchoring specified in the SAPs. Further desired attributes are adaptability and user-friendliness. ResultsWe developed RWE-BRIDGE, a metadata schema with a star-schema model divided into four study design sections with 12 tables: Study Variable Definition with two tables, Cohort Definition with two tables, Post-Exposure Outcome Analysis with one table, and Data Retrieval with seven tables. We provide examples and a step-by-step guide to populate this metadata schema. In addition, we provide a Shiny app that checks the several tables proposed in this metadata strategy. RWE-BRIDGE is available at https://github.com/UMC-Utrecht-RWE/RWE-BRIDGE. DiscussionThe RWE-BRIDGE has been designed to support the translation of study design sections from statistical analysis plans into analytical pipelines, facilitating collaboration and transparency between lead researchers and scientific programmers and reducing hard coding and repetition. This metadata schema strategy is flexible by supporting different common data models and programming languages, and it is adaptable to the specific needs of each SAP by adding further tables or fields, if necessary. Modified versions of the RWE-BRIGE have been applied in several RWE studies within the VAC4EU ecosystem. ConclusionThe RWE-BRIDGE offers a systematic approach to detailing what type of variables, time anchoring, and algorithms are required for a specific RWE study. Applying this metadata schema can facilitate the communication between epidemiologists and programmers in a transparent manner.

19

Causal mediation for uncausally related mediators in the context of survival analysis

Domingo-Relloso, A.; Jerolon, A.; Tellez-Plaza, M.; Bermudez, J. D.

2024-02-18 epidemiology 10.1101/2024.02.16.24302923 medRxiv

Top 0.1%

23.9%

Show abstract

ObjectiveThe study of the potential intermediate effect of several variables on the association between an exposure and a time-to-event outcome is a question of interest in epidemiologic research. However, to our knowledge, no tools have been developed for the evaluation of multiple correlated mediators in a survival setting. MethodsIn this work, we extended the multimediate algorithm, which conducts mediation analysis in the context of multiple uncausally correlated mediators, to a time-to-event setting using the semiparametric additive hazards model. We theoretically demonstrated that, under certain assumptions, indirect, direct and total effects can be calculated using the counterfactual framework with collapsible survival models. We also adapted the algorithm to accommodate exposure-mediator interactions. Results and conclusionsUsing simulations, we demonstrated that our algorithm performs better than the product of coefficients method, even for uncorrelated mediators. The additive hazards model quantifies the effects as rate differences, which constitute a measure of impact, with applications that can be highly informative for public health. Our algorithm can be found in the R package multimediate, which is available in Github.

20

Drivers of bias in diagnostic test accuracy estimates when using expert panels as a reference standard

Kellerhuis, B. E.; Jenniskens, K.; Schuit, E.; Hooft, L.; Moons, C.; Reitsma, J. B.

2023-07-28 epidemiology 10.1101/2023.07.26.23293187 medRxiv

Top 0.1%

23.7%

Show abstract

ObjectivesTo assess the impact of study and expert panel characteristics on index test diagnostic accuracy estimates. Study Design and SettingSimulations were performed in which an expert panel was used as reference standard to estimate the sensitivity and specificity of an index diagnostic test. Diagnostic accuracy was determined by combining probability estimates of target condition presence, as provided by experts using four component reference tests, through a predefined threshold. Study and panel characteristics were varied in several scenarios: target condition prevalence (20%, 40%, 50%), accuracy of component reference tests (70%, 80%, mixed), expert panel size (2, 3, 10), study population size (360, 1000), and random or systematic differences between experts probability estimates. Bias in accuracy estimates across all possible true index test values was quantified for all scenarios. The total bias in each scenario was quantified using the mean squared error (MSE). ResultsWhen estimating an index test with 80% sensitivity and 70% specificity, bias in these estimates was hardly affected by the study population size or the number of experts. When one expert was systematically biased, bias in sensitivity and specificity estimates increased, but this effect lessened when the number of experts in the panel was higher. Prevalence had a large effect on bias, scenarios with a prevalence of 0.5 estimated sensitivity between 63.3% and 76.7% and specificity between 56.1% and 68.7%, whereas scenarios with a prevalence of 0.2 estimated sensitivity between 48.5% and 73.3% and specificity between 65.5% and 68.7%. Random and systematic differences between experts also increased bias, with estimated sensitivity between 48.6% and 77.4% and specificity between 59.1% and 69.1% as opposed to scenarios without random or systematic differences, which estimated sensitivity between 58.0% and 77.4% and specificity between 56.1% and 69.1%. More accurate component reference tests also reduced bias. Scenarios with four component tests of 80% sensitivity and specificity estimated index test sensitivity between 60.1% and 77.4% and specificity between 62.9% and 69.1%, whereas scenarios with four component tests of 70% sensitivity and specificity estimated index test sensitivity between 48.5% and 73.4% and specificity between 56.1% and 67.0%. ConclusionBias in accuracy estimates when using an expert panel will increase if the component reference tests (combined) are less accurate. Prevalence, the true value of the index test accuracy, and random or systematic differences between experts can also impact the amount of bias, but the amount and even direction will vary between scenarios.